flowchart LR START[(HIV+ blood samples)] M1[???] M2[???] M3[???] END[(TSI estimates)] START --> M1 --> M2 --> M3 --> END
2023-11-12
incidence estimation
identifying sub-populations suffering from recent infections are happening.
determining generation intervals.
but getting data is hard:
longitudinal cohorts as gold standard
alternatives which are discussed in this section
This is great, but specificity and sensitivity are hard to control:
Laeyendecker et al. (2013)
Ragonnet-Cronin et al. (2022) developed recency classifier
comments on model:
Ragonnet-Cronin et al. (2022)
Golubchik et al. (2022) tackles weaknesses of previous approach:
sequences from PANGEA and BEEHIVE with known infection range.
choice of predictors based on viral diversification following infection.
different signal in different part of the genome. How to combine predictors?
Best performing set of features chosen through LOOCV
Regressor performs well as a classifier too, but FRR (~10%)
For applications:
model bias was close to 0 for all subtypes included in the dataset.
Some individuals in the training data reported prior ART use.
followed pattern
BUT: small sample size
AND: no predictions for suppressed individuals.
flowchart LR START[(HIV+ blood samples)] M1[???] M2[???] M3[???] END[(TSI estimates)] START --> M1 --> M2 --> M3 --> END
Main steps:
*.bam, *_ref.fasta produced by shiverThe figure was generated with AliView
The figure was generated with FigTree
phyloscanner Wymant et al. (2017) summarizes each tree through summary statistics:patStats.csv: contains LRTT, number of tips, etc…MAF12c) and/or third codon position (MAF3c)graph LR
subgraph S1["Primary Data Patients"]
direction TB
I01[(*.bam)]
I02[(*ref_fasta)]
I02b[(*BaseFreqs_WithHXB2.csv)]
end
subgraph S2["Reference Sequences"]
I03[(*.fasta)]
end
subgraph S3["Multiple Sequence Alignment"]
P1(mafft)
O1[(*.fasta)]
end
subgraph S4["Constructing Phylogenies"]
direction TB
P2(IQTREE)
O2[( *.iqtree )]
end
subgraph S2b["Computing MAFs"]
direction TB
P2b(script)
O2b[(maf.csv)]
end
subgraph S5["Analysing phylogenies"]
direction TB
P3( PhyloscannerR )
O3[(*PatStats.csv)]
end
subgraph S6["Obtain TSI estimates"]
direction TB
P4( HIV-phyloTSI )
O4[(TSI.csv)]
end
classDef Red fill:#F84E68
class S3,S4 Red
classDef Orange fill:orange
class S5,S6 Orange
I02 --> P1
I01 --> P1
I02b --> P2b --> O2b --> P4
I03 --> P1
P1 --> O1
O1 --> P2
P2 --> O2
O2 --> P3
P3 --> O3
O3 --> P4
P4 --> O4
Note
Steps run for each group and window are shown in red, while those running by group in orange
Input: divergence measures which are differently informative depending on position on the genome.
ML algorithm to capture complexity:
There are two tools that will make our lives easier:
The ‘ingredients’ to run the analyses are: the ML algorithm; the data and the code dependencies. The below code chunk allows you to download everything that is needed for the analyses.
# change directory to where you want to install HIV-phyloTSI repo
cd $HOME/git # this is where I install git packages
# cd $HOME && mkdir git && cd git
# clone directories necessary to run the analysis
# BDIs code and workshop materials
git clone git@github.com/BDI-pathogens/HIV-phyloTSI.git
git clone git@github.com/abriz97/HIV-phyloTSI-workshops.git
# store paths to 2 above directories
DIR_WORKSHOP="$(pwd)/HIV-phyloTSI-workshops"
DIR_PROGRAM= "$(pwd)/HIV-phyloTSI"
# install python dependencies for HIV-phyloTSI and load the environment
conda env create -f HIV-phyloTSI-workshops/hivphylotsi.yml
conda activate hivphylotsiNote
When interacting with a terminal, learning the power of the $ operator is key. The operator allows to evaluate variables (eg. $HOME) or to evaluate commands surrounded by brackets ( eg. $(pwd)).
Once all the ingredients are there, we can start cooking. It is relatively simple to run the analyses, even though we need to be precise in the way we specify the paths to the input data.
# Run HIV-phyloTSI on input data.
python $DIR_PROGRAM/HIV-phyloTSI.py \
-d $DIR_PROGRAM/Model \
-p $DIR_WORKSHOP/HIV-phyloTSI-workshops/input/ptyr1_patStats.csv \
-m $DIR_WORKSHOP/HIV-phyloTSI-workshops/input/phsc_input_samples_maf.csv \
-o $DIR_WORKSHOP/HIV-phyloTSI-workshops/output/ptyr1_tsi_workshop.csv
# print header of output to make sure it exists:
head $DIR_WORKSHOP/HIV-phyloTSI-workshops/output/ptyr1_tsi_workshop.csvNote
The first 2 lines point to $DIR_PROGRAM because they refer to the code we want to use. On the other hand, the bottom 3 lines refer to the input data and output paths, and this is why they point to $DIR_WORKSHOP.
I provide some R functions and scripts to visualise results, which can be found in the github repository:
$DIR_WORKSHOP/src/workshop_analyses.R$DIR_WORKSHOP/src/R/workshop_R_helpers.RAgain, we can use conda to install the necessary packages:
I will be showing snippets of the above code together with the plots they produce.
You can reproduce the steps by opening up the script in RStudio.
Following Freeman and Hutchison (1980) and Brookmeyer and Quinn (1995), if incidence is constant, it can be estimated as:
\[ I = \frac{\text{# recent}}{\text{# uninfected} \times MDRI } \]
where \(MDRI\) = mean duration of recency of infection.
Here we focus on the numerator.
source: slides by Laeyendecker
In simulated settings, the prevalence of recent infections is well estimated, despite at least 2% of recent classifications are wrong.
Golubchik et al. (2022)
When directly applied to real-world data, HIV-phyloTSI generally produces smaller TSI
for known recent infections as compared to people with unknown first positive date:
=> Can be used to compare median TSIs among population subgroups.
HPTN 071-02 Phylogenetics ancillary study to the HPTN 071 (PopART): samples from HIV-positive participants in 9 communities in Zambia 2014-2019
Comparing simple summary statistics can be misleading:
For reliability:
Bootstrapping: statistical technique which recycles analysis data to estimate uncertainty around an estimator (e.g. median).
code here
Groups should be made based on covariates different than HIV-phyloTSI inputs or outputs.
Do NOT:
Estimation of generation time distribution
Enriching source-recipient pairs by providing time since infection.
Important
To account for individual level uncertainty, these studies not only make use of central estimates, but also the output prediction/uncertainty range
Monod et al. (2023)
Infection pairs data from Rakai Community Cohort Study
Transmission pairs detected with phyloscanner Ratmann et al. (2020)
Question: how did transmission pattern change over time?
Need to date infections
Monod et al. (2023)
Monod et al. (2023)
HIV-phyloTSI is a novel algorithm to estimate infection dates.
alternative to serological assays which allows more control on definition of recency (Robust to subtype and ART usage)
preliminary analyses and simulation studies demonstrate good performances at population level
but may be inaccurate at the individual level: very large prediction intervals.
Can help us explore answers to unanswered questions